Massive parallelization of serial inference algorithms for a complex 

generalized linear model 
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Abstract 

Following a series of high-profile drug safety disasters in recent years, many countries are 
redoubling their efforts to ensure the safety of licensed medical products. Large-scale observa- 
tional databases such as claims databases or electronic health record systems are attracting par- 
ticular attention in this regard, but present significant methodological and computational con- 
cerns. In this paper we show how high-performance statistical computation, including graphics 
processing units, relatively inexpensive highly parallel computing devices, can enable complex 
methods in large databases. We focus on optimization and massive parallelization of cyclic 
coordinate descent approaches to fit a conditioned generalized linear model involving tens of 
millions of observations and thousands of predictors in a Bayesian context. We find orders-of- 
magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in 
high-dimensional statistics and the algorithms we propose open up exciting new methodological 
possibilities with the potential to significantly improve drug safety. 

1 Motivation and Background 

Increasing scientific, regulatory and public scrutiny focuses on the obligation of the medical com- 
munity, pharmaceutical industry and health authorities to ensure that marketed drugs have ac- 



ceptable benefit-risk profiles (Coplan et al. , 2011). Longitudinal observational databases provide 



time-stamped patient-level medical information, such as periods of drug exposure and dates of 
diagnoses, and are emerging as important data sources for associating the occurrence of adverse 
events (AEs) with specific drug use in the post-marketing setting once drugs are in wide-spread 
clinical use ( [Stang et al. 2010). Some relevant papers focusing on drug safety from observation 
databases include [Curtis et al.| p008|); |Jin et aL| (|2008D; p| (|2009D; [Noren et al.| (|2008D ; [Schneeweiss 



et al. (2009); Kulldorff et al. (2011) Typical examples of these observation databases encompass 



administrative medical claims databases and electronic health record systems, with larger claims 
databases containing upwards of 50 million lives with up to 10 years of data per life and exposure 



to 1000s of different drugs (Ryan et al., 2012). 



The scale of these massive databases presents compelling computational challenges when at- 
tempting to estimate the strength of association between each drug and each of several AEs, while 
appropriately accounting for covariates such as other concomitant drugs, patient demographics and 
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concurrent disease. Generalized linear models (GLMs) with unknown parameter regularization or 
Bayesian priors offer one thriving opportunity to estimate association strength while controlling 
for many other covariates (iMadigan et al. 2011). However, naive implementation even to find 



maximum a posteriori (MAP) point-estimates in standard statistical packages grind to an almost 
stand-still with millions of outcomes and thousands of predictors, and hope of estimating even poor 
measures of uncertainty on drug-specific association estimates vanishes. 

One usual approach to the computationally intensive task of statistical model fitting entertains 
distributing the work across a specialized and costly cluster or cloud of computers. This approach 
achieves most success when the distributed jobs consist of lengthy "embarrassingly parallel" (EP) 
operations, such as the independent simulation of whole Markov chains in MCMC ( Wilkinson| 
2006). However, the cluster with its distributed memory commands high communication latency 



between operations, rendering even MAP estimation in a GLM often unworkable, let alone esti- 
mation of second order terms such as standard errors. MAP estimation is generally an iterative 
algorithm, and the potentially parallizable work within each step is rarely sufficient to outweigh 
the communication latency and thread management costs. 

For massive parallelization that overcomes many of these issues, there exists a much less ex- 
pensive resource available in many desktop and laptop computers, the graphics processing unit 



(GPU); see, for example, Suchard et al. (2010) for a gentle introduction in statistical model fitting. 
GPUs are dedicated numerical processors originally designed for rendering 3-dimensional computer 
graphics. A GPU consists of tens to hundreds of processor cores on a single chip. These can be 
programmed to perform a sequence of numerical operations simultaneously to each element of a 
large data array. The acronym SPMD summarizes this single program, multiple data paradigm. 
Because the same operations, called kernels, function simultaneously, GPUs can achieve extremely 
high arithmetic intensity provided one can transfer input data and output results onto and off of 
the processors efficiently. Because the parallel threads driving the kernels operate on the same 
computer card, the cost of spawning and destroying millions of threads within each iterative step 
of the MAP estimation is neglible, and communication latency between threads is minimal. How- 
ever, statisticians have been slow to embrace the new technology, due in part to a preconception 



that GPUs work best with EP operations. To dispell these ideas, Silberstein et al. (2008) first 



demonstrated the potential for GPUs in fitting simple Bayesian networks. Recently, Suchard and 



Rambaut (2009) and Suchard et al. (2010) have seen greater than 100- fold speed-ups in MCMC 



simulations involving highly structured graphical models and mixture models. Lee et al. (2010) 



and Tibbits et al. (|2011 ) are following suit with Bayesian model fitting via particle filtering and 
slice sampling, and Zhou et al. (2010) demonstrate GPU utility for high- dimensional optimization. 



In this paper, we explore the use of GPU parallelization in fitting a real- world problem involving 
a penalized GLM to massive observation datasets with high-throughput computing needs. We 



entertain recent advances in a Bayesian self-controlled case series (BSCCS) model (Madigan et al. 



2011) and by exploiting the sparsity of the resulting database design matrix, we optimize a cyclic 



coordinate descent (CCD) algorithm to generate MAP estimates for this high-dimensional GLM. 
Given the substantial speed-up that optimization and the GPU afford, we provide for the first time 
rough estimates of the prior hyperparameters and limited measures of coefficient uncertainty. 



2 Methods 

GLMs assume subject outcomes arise from an exponential family distribution whose mean is a de- 
terministic function of an outcome-specific linear predictor (Nelder and Wedderburn, 1972). These 
models include, for example, logistic regression, Poisson regression and several survival models. 
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Figure 1: Representative drug exposure and adverse event (myocardial infarction, MI) history for 
one 78 year-old male. In constant exposure era fc = 4, this subject suffers an MI (yj^4 = 1) and is 
taking Vioxx, Olanzapine and Celecoxib {xi^^^y = xi_4^o = ^j,4,c = !)• In era k = 10, this subject 
suffers two Mis (yi,io = 2) and is taking Quetiapine (xj^icq = 1). 



Often, study designs necessiate matching subjects or conditioning on sufficient statistics of the 
generative distribution to infer the relative effects of predictors; this leads to complex GLMs with 
likelihood-functions that grow computationally expensive in massive datasets. We explore one such 
model as a case-study in optimization and parallelization. 

2.1 Bayesian Self-Controlled Case Series Model 



Farrington (1995) proposed the self- controlled case series (SCCS) method in order to estimate 
the relative incidence of AEs to assess vaccine safety. SCCS provides a cases-only (subjects with 
at least one AE) design that automatically controls for time-fixed covariates. Since each subject 
serves as her own control, all individual-specific effects drop out of the model likelihood and the 
method compares AE rates between exposed and unexposed time-intervals through an underlying 
inhomogeneous Poisson process assumption. 

Suppose an observational database tracks the drug exposure and AE history of i = 1, . . . ,N 
subjects who each experience at least one AE. Figure [T] cartoons such a history. During observation, 
subjects start and stop individual regiments of j = 1, . . . , J possible drugs accumulated across 
all subjects. These regiments partition each subject's observational period into k = l,...,Ki 
eras, during which drug exposure is (assumed to be) constant. Drug exposure indicators Xj^ = 
(xiki, . . . , Xikj)' and time-lengths kk (generally recorded in days) fully characterize each era for each 
subject, where Xikj = 1 if exposed to drug j or otherwise 0. Finally, yik counts the number of AEs 
for subject i during era k and, for convenience, we group = {yn, . . . , yiKi)' i Xj = (xji, . . . , 'X-iKi)' 
and \i = {ki, . . .,liKj- 

A SCCS assumes that these AEs arise according to an inhomogeneous Poisson process, where 
a subject baseline effect e^' and drug exposures multiplicatively modulate the underlying instanta- 
neous event intensity Xik = e'^^^^i^f^ for subject i during era k. Here, /3 = (/3i, . . . , /3j)' are unknown 
relative risks attributable to each drug. Consequentially, yik ~ Poisson(Ajfc). Due to computa- 
tional demand, researchers have typically applied this model to study only one potential exposure 



at a time, ignoring correlation between drugs. Madigan et al. (2011) provide further details on the 



development of the multivariate SCCS involving multiple drugs simultaneously as used here. 

In order to avoid estimating parameters (pi for all i, the SCCS method conditions on their 
sufficient statistics. Under the Poisson assumptions, these sufficient statistics are the total number 
of AEs Hi = YlkVik ^ subject experiences over her observation period, yielding the model 
likelihood contribution for each subject, 




Pr(y.|x.,nO = p^^^^^^^^ oc||[ _ . J . (1) 
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Naturally, it is also clear from the likelihood expression that if a subject experiences no AEs (n^ = 0), 
then that subject does not contribute to the model likelihood, providing a cases-only design. 
Taking all subjects as independent, we write the log-likelihood as a function of unknown /3 as 



N 



i=l 



^ {Vik x'ikP) - log ^ kk 



.k=l 



\k=l 



(2) 



This likelihood furnishes a complex GLM that carries a high computational burden, arising from 
the conditioning and renormalization. In practice, one needs to keep track of the sum of a large 
number of terms for each subject and each term requires exponentiation and then weighting. Such 
a burden quickly grows prohibitive for the millions of cases available in observational databases. 



Priors In drug safety surveillance there exist thousands of potential drugs. This high dimension- 
ality can lead to severe overfitting under the usual maximum likelihood approach, even for massive 
datasets, so regularization remains necessary. As an alternative, we adopt a Bayesian approach 



by assuming a prior p{P) over the drug effect parameter vector, constructing a BSCCS (Madigan 



et al. , 2011) and performing inference based on posterior mode estimates. We refer interested 



readers to Kyung et al. (2010) for a more in-depth discussion of the connections between penalized 
regression and some Bayesian models. 

To develop p{f3), we naturally assume that most drugs have no appreciable effect and consider 
distributions that shrink the parameter estimates toward or to to also address overfitting. We 
focus on two choices: 

(3j ~ Normal(0, cr^) or /3j ~ Laplace ( 0, cr^) (3) 

for all J, where a"^ is the unknown hyperparameter variance of each distribution. Under the Normal 
prior, finding the posterior mode estimates is analogous to ridge conditioned-Poisson regression with 
its L2-norm constraint on (3, and, under the Laplacian prior, we achieve a lasso conditioned-Poisson 



regression with its Li-norm constraint (Tibshirani, 1996) 



2.2 Maximum A Posteriori Estimation using Cyclic Coordinate Descent 



CCD algorithms (d'Esopo, 1959; Warga, 1963) for fitting generalized linear models with Li or 
L2 regularization priors come in many flavors (Wu and Lange, 2008). The overarching theme of 



these algorithms promotes forming a fixed or random cycle over the regression parameters /3 and 
updating one element /3j at a time, achieving after iteration their maximum a posteriori estimates 
/^MAP = • • • ) /3j)- These updates require evaluating the log posterior gradient ^P{(3) and 
Hessian ^P{P), where P(/3) = L{/3) +logp(/3), along a single dimension only, and thus avoid 



the "Achilles heel" (Wu et al. , 2009) of the more standard multivariate Newton's method that 
necessiates inverting the complete and high-dimensional Hessian at each iteration. 

Within the cycle, CCD implementations often differ in the size of the one-dimensional step 
A/3j they take. The traditional algorithm proposes iterating one-dimensional Newton's method 
updates to convergence. Others consider a single-step update based (sometimes loosely) on New- 
ton's method, where one bounds the second derivates or APj directly to ensure a descent property 



and minimize algebraic work ( 


Lange 


1995 


Genkin et al. 


2007 


Wu and Lange 


2008) 



overhead of monitoring for convergence of the single-parameter Newton's methods. 

To explore MAP estimation via CCD for the BSSCS model, we follow the success of Genkin 
et al. (2007) and employ an adaptable trust-region bound on A/3j, where the unbounded A/3j 
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follows from a single application of Newton's method (Zhang and Oles, 2001| ): 



^ 4[L(/3) + logp(/3)] ^ g,(/3) + 4-logp(/3) 
' ^ [L(/3) + logp (/3)] /i, (/3) + ^ logp (/3) ■ 



We outline the complete fitting procedure in Algorithm [Tj Following Genkin et al. ( 2007 ) , we 



declare convergence when the sum of the absolute change in X/3 from successive iterations falls 
below e = 0.0005. The preceeding approach has been effective in fitting the BSSCS model to 



modest datasets (Simpson, 2011). Our present inquire in what follows attempts to extend this 



success to massive observation databases. 



Algorithm 1 Cyclic coordinate descent algorithm for fitting Bayesian self-controlled case series 
model. Computationally demanding steps are highlighted as targets for parallelization. While all 
variables are defined in the text, we identify f3 here as the J-dimensional regression coefficients over 
which we wish to maximize the log-posterior in this algorithm. 

1: Initiahze: (3 = which imphes [X/3] = 0, [L x exp (X/3)] = L, and M [L x exp (X/3)] = ML 
2: Initialize: outer iteration counter t = 1 
3: repeat 

4: for inner cycle j = 1 to J do 

5: Compute unidirectional gradient gj{P) and Hessian hj{j3) (target for parallelization) 
6: Compute A/3j given gj{f3), hj{(3) and derivatives of prior p {(3) 
7: if Af3j ^ then 
8: (3^f3+ (A^j) ej 

9: Update [X/3], [L x exp X/3] and M[L x exp X/3] (target for parallelization) 

10: end if 
11: end for 
12: Update t + 1 
13: until convergence in X/3 occurs 
14: Report: fB^j^p and maximized log-posterior P{Pmap) 



2.3 Computational Work 



To gain a handle on the computational work involved in fitting the BSCCS model, let K = Ki 
count the total number of (unique within subject) exposure eras across all subjects. Define X = 
vec (x^^) as the sparse K x J design matrix that consists solely of and 1 entries, indicating if drug 
j contributes to each of the K patient/exposure-era rows. Likewise, form Y = {yu, . . . jUnKmY 
and L = (Zn, . . . , InKnY if-dimensional column vectors, N = (ni, . . . , tin)' as an A^-dimensional 
column vector and M as a sparse N x K loading matrix with entries 

Mik = l ^ ^""^ ^^'"i ^"^ < ^ - Sm=i ^™ (5) 
* \ otherwise. 

Making these substitutions into Equation ([2]), we achieve 

L (/3) = Y'X/3 - N' log {M [L x exp (X/3)]} , (6) 

where we have defined multiplication (x), exponentiation (exp) and forming the logarithm (log) of 
a column-vector as element-wise operations. It remains possible to avoid the Hadamard product 
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definition of element-wise multiplication and, to come shortly, division (/) in favor of standard 
matrix-multiplication and matrix-inversion by exploiting a reparameterization of the loading matrix 
and diagonal matrices. However, their use belies the simplicity of the element-wise, and hence highly 
parallelizable, operations we encounter in practice in computing the unidimensional gradients and 
Hessians. Differentiating L with respect to Pj returns the necessary unidimensional gradient 

r)T 

where 

^ M[Lxexp(X/3)xX,-] 

M [L X exp (X/3)] ' ^ ' 

and vector Xj is the column of X. Likewise, the relevant entry of the Hessian matrix falls out 
as 

^.•(/3) = = -N'[Wx(l-W)]. (9) 



2.4 Targets for Parallelization 

CCD, along with most forms of statistical optimization and Markov chain Monte Carlo, is an 
inherently serial algorithm. As reminded in Algorithm [T| even within a iteration t, one cycles over 
parameters j to update and work cannot begin computing the next parameter update until the 
current update completes. Such algorithms do not immediately appear amenable to parallelization. 
However, all is not lost when one considers the proportion of computational work performed within 
each update to the computational overhead of the serial component. CCD carries a surprisingly 
light-weight serial component, and for the BSCCS model applied to even the smallest observational 
database described below, over 99.5% of the run-time lies in computing gj{(3) and hj{(3) alone. 

To provide insight for readers who wish to explore massive parallelization in their own ap- 
plications, we study the computational complexity of evaluating gj{(3) and hj(P) and, in the 
process, identify likely targets for optimization and parallelization. Common to both gj{P) and 
hj{(3) is W; hence efficient computation proceeds via first evaluating M [L x exp(X/3) x Xj] and 
M [L X exp(X/3)] that comprise W. To compute these, we 

1. Update [X/3] - given X/3 and A/3j_i from the previous iteration, X/3 X/3 + A/3j_iXj_i. 
When X is dense, the serial complexity of this operation is O (K). For sparse X, the worst- 
case complexity decreases to O (X^ax) where X^^x is the maximum of # (Xj) over j = 1, . . . , J 
and ^ (Xj) counts the number of non-zero entries in Xj. In general, X^^^ <^ K. 

2. Evaluate or update [L x exp(X/3)] - while this is also a X- dimensional vector, only the 
elements for which Xj_i are non-zero have changed; therefore, this step either re-evaluates 
all elements with O (K) or updates a few elements in O (X^^^). In both cases, the scaling 
constant is large because computing exp(x) requires 10s to 100s of times longer than elementry 
floating point operations. 

3. Evaluate or update M [L x exp(X/3)] - for dense X, this a sparse-matrix/dense- vector mul- 
tiplication; # (M) = K, achieving O (K). When X is sparse, it remains faster to update just 
the affected elements in O (X^^^); see Listing |3] for details of this update. 
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4. Evaluate M [L x exp(X/3) x Xj] - here we find either a sparse- matrix/dense- vector multiph- 
cation or an unusual sparse-matrix/sparse- vector multiplication with worst-case complexity 

O (Xmax) 

Steps ([T]) through ([s]) depend on Xj_i and convenience suggests performing these steps at the end 
of the previous iteration to reduce book-keeping. We illustrate this point in Algorithm [T] Also 
noted in Algorithm [T] is the observation that these steps only need envoking when Af3j ^ 0. For 
the Laplacian prior with its discontinuity at 0, Af3j = occurs regularly. 



Exploiting Sparsity Starting with Step ([T]) above, a very naive implementation recomputes the 
matrix-vector product at each cycle with complexity O (KJ) and the potential to drive run-time to 
a dead-lock. Zhang and Oles (2001) and IWu et al. (2009) independently identify the savings that 



the one-dimensional update affords here. These works, along with Genkin et al. (2007), exploit 
the sparsity of X in updating the dense ii'-dimensional vector X/3 (Step[T]). For comparison, these 
papers refer to X/3 as (ri, . . . , rjsf). However, we are unaware of others who continue to exploit the 
sparsity of X in moving from X/3 through to the subject-specific components of the gradient and 
Hessian (Steps [2]- 14]). 

With the numerator and denominator components of W in hand, we form a simple element- wise 
transformation and take two simultaneous inner products to return N'W and N' [W x (1 — W)]. 
We discuss the advantages of these fused reductions for both host CPUs and GPUs shortly. This 
operation carries worst-case serial complexity O (N) when all subjects have at least one exposure era 
that the current drug influences. Since several drugs carry prevalences among subjects nearing 25 - 
50%, keeping track of the non-zero elements in W and performing sparse operations often reduces 
efficiency given the extra over-head and irregular memory access. In either case, the work in this 
fused-reduction is far greater than the work required to compute the one-dimensional gradient and 
Hessian contributions of the prior on /3j, so we leave the prior details to the reader. Finally, after 
cycling over all drugs j, we evaluate the convergence criterion. Whether one checks the change 
in X/3 (Genkin et al. , 2007) or in the log-posterior, these computations remain a daunting O {K). 
Fortunately, we only envoke them once per complete cycle and this task's run-time becomes nearly 
irrelevant for moderate J. 



Fine-Scale Parallelization From these computational complexities, we immediately identify 
that Step ([2]) dominates run-time when X is dense at O (K) and with a very large scalar constant. 
On the other hand, for sparse X, the fused reduction at O (N) trumps run-time. Fortunately, these 
operations, along with the other update steps, are prime targets for parallelization using GPUs. 

Code Listing [1] presents a basic CUDA kernel to perform the dense computation of Step ([2]). 
To envoke this kernel, the host program requests the short-lived execution of K threads, one for 
each computed element. Unlike coarser-scale parallelization using MPI across clusters or even 
multi-core approaches, the cost of creating and destroying threads is often negligible for GPUs; 
this makes such fine-scale parallelization ideal for embedding within serial algorithms. While each 
thread executes independently in this kernel as there is no shared data between threads, we still 
group threads into relatively large thread-blocks of size, say, 8 x 16 or 16 x 16. For all NVIDIA 
hardware, 16 sequential global memory read/writes "coalesce" into a single transaction, and the 
GPU interleaves the execution of multiple 16-thread sets in the same block to hide transaction 
latency. While recent GPUs relax the sequential requirement modestly, both processes significantly 
decrease memory-bandwidth limitations, improving arithmetic throughput. All of the work of this 
kernel falls in a single line of code. The theoretical complexity of this operation in parallel reduces 
to O (1); however, in practice, one achieves O (K/C) where C counts the number of GPU processing 
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Listing 1: Dense CUDA kernel for element-wise evaluation of L x exp(X/3) given L and X/3 



1 __global__ void evaluateLExpXBeta ( 

2 real* LExpXBeta , const int* L, 

3 const real* XBeta , int K) { 

4 // Determine element index for this thread 

5 int idx = blockldx.x * blockDini.x + threadldx.x; 

6 // Perform scalar operation on each vector element 

7 if (idx < K) { 

8 // All coalesced memory access 

9 LExpXBeta [ idx ] = L[idx] * exp (XBcta [ idx ] ) ; 
} 

11 } 



cores available to the host. This quantity can range from the low-lOs on an integrated GPU in 
a mobile device or laptop to the mid-lOOs on a dedicated GPU card in a desktop through to the 
low- 1000s on multiple GPU devices attached to a single host. 

Fused Operations We turn our attention to the element-wise transformation and simultaneous 
inner products that provide a noteworthy example of effective optimization for massive datasets. 
Fusing these steps into a single operation avoids explicitly forming W, writing to its location in 
memory N times and then immediately reading from it A'^ or 2N times, depending on if we per- 
form the inner products simultaneously or separately, for each cycle step. Further, we only need 
to read from N once during the simultaneous reduction. This can significantly reduce memory 
bandwidth requirements on both the host CPU and GPU. Memory bandwidth measures the rate 
at which the processor can read data from or store data to memory. With increasingly faster pro- 
cessor speeds, many algorithms in statistics are memory bandwidth-limited rather than arthimetic 
throughput-limited. These optimization strategies fall under the name of "lazy evaluation" to "re- 
duce temporaries" and warrant greater recognition among computational statisticians who provide 
high-performance tools. Modern computing language compilers are very proficient at optimizing 
away such intermediates for scalar operations, but often fail for vectors since large vectors cannot 
be held entirely in processor registers. For CPU-computing, high-level linear algebra libraries, such 
as Eigen, adeptly reduce the use of vector and matrix temporaries and provide lazy evaluation 



through "expression templates" (Veldhuizen, 1995). For the GPU, the Thrust library furnishes 



expression templates for fusing a univariate-to-univariate transformation and reduction of a single 
input vector, and we highly recommend this tool. However, currently, statisticians are hardpressed 
to find a GPU library that takes two input vectors, performs a bivariate-to-bivariate transforma- 
tion and simultaneously reduces both output vectors. While such functionality may initially appear 
convoluted, it finds use in efficiently computing the one-dimensional gradient and Hessian for any 
GLM with minor modification to the transformation. To this end, we hand-craft our own. 

Listing [2] presents our fused CUDA kernel. To envoke this kernel, the host program requests 
the execution of a moderate number (PARTI AL_SUM < 64) of thread-blocks in which each block 
drives 256 or more (BLOCK_SIZE) threads depending on the hardware. In parallel, each thread 
begins by looping over A/(PARTIAL_SUM x BLOCK.SIZE) elements in M [L x exp(X/3) x Xj], 
M [L X exp(X/3)] and N, forming their transform and accumulating both inner product contri- 
butions for these elements. We interleave which elements each thread visits to coalesce memory 
transactions. Once completed, the threads within a block exploit the block's shared memory to 



perform two generic O (log BLOCK_SIZE) tree-based parallel reductions (Harris 2010). While lim- 
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Listing 2: Fused CUDA kernel for transformation and reduction of numerators 
M [L X exp(X/3) X Xj], denominators M [L x exp(X/3)] and N to partial-sums of gj{P) and 
hj{l3). Partial-sums end in length PARTI AL_SUM and we further reduce these on the host for 
efficiency. Function paralleReduction performs a generic logarithimic-order reduction in shared 

memory. 



1 __global__ void fusedTransformationAndReduction ( 

2 const real* Numerator, const real* Denominator, const int* N, 

3 real* Gradient, real* Hessian, int length) { 

4 

5 // Define shared memory for thread-block reduction 

6 __shared__ real sGradient [PARTTALJSUM] , sHessian [PARTIALJSUM] ; 

7 

8 // Partial sums for this thread 

9 real tSumGradient = 0.0, tSumHessian = 0.0; 

10 

11 // Determine first element index for this thread 

12 int idx = blockldx.x * PARTIAL^UM + threadldx.x; 

13 while (idx < length) { // Each thread processes multiple entries 

14 

15 / / Do transform of this entry and add to local thread sum 

16 real ratio = Numerator [ idx ] / Denominator [ idx ] ; // Coalesced memory access 

17 intn=N[idx]; // Coalesced memory access 

18 real tGradient = n * ratio ; 

19 tSumGradient += tGradient ; 

20 tSumHessian -|-= tGradient * (1.0 — ratio); 

21 

22 idx -|-= PARTIAL_SUM * gridDim.x; // Index next element for thread 

23 } 
24 

25 // Reduce across all threads in block, leaves total in first element of shared memory 

26 parallelReduction ( sGradient , tSumGradient); 

27 parallelReduction ( sHessian , tSumHessian); 

28 

29 // Only one thread writes block result 

30 if (threadldx.x = 0) { 

31 Gradient [ blockldx . x] = sGradient [ ] ; 

32 Hessian [ blockldx . x] = sHessian [0]; 
} 

34 } 
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ited in quantity, shared memory on a GPU sports orders of magnitude faster access time than 
global memory and is accessible by all threads in the same block during their execution. Shared 
memory enables threads to conveniently share data, such as is required in the tree-reduction. The 
CUDA software development kit (SDK) furnishes several examples of tree-based parallel reduc- 
tions. The output from this kernel are sets of partial-sums for gj{(3) and hj{P), each of length 
PARTIAL.SUM <^ N. Instead of envoking a second round of parallel reduction on these partial- 
sums, we perform the final work in series on the host. Because of high communication latency 
between the host and GPU device, it takes comparable time to transfer 2 floating-point values 
as the modest 2 x PARTIAL.SUM. CPU/GPU work-balance will be a hallmark for speed-efficient 
statistical fitting of massive datasets. 

In terms of work-balance, while the fused reduction is the rate limiting step for sparse X on 
both the CPU and GPU, we can significantly decrease the communication latency between the 
host and GPU by additionally off-loading all of Steps ^ through Q to the GPU well. Instead 
of uploading 2 x N fioating-point numbers to the GPU in each cycle step, we succeed in reducing 
this number to a single floating-point A/3j. The cost, of course, is additional programming and the 
need for performing sparse operations on the GPU. 



Representation in Memory Memory access is often irregular for sparse linear algebra, and 
the computational statistician needs to pay particular attention to how both sparse matrices and 



vectors are represented in memory (Bell and Garland 2009: Baskaran and Bordawekar, 2009). 



For example, the optimal representations for X and N differ. Only single columns of X enter 
into Steps ([l]) and Q at a time, highlighting the need for compressed column storage (CCS) in 
which one places consecutive non-zero elements of each column Xj into adjecent memory addresses. 
While the standard representational choice for sparse- matrix/dense- vector (spMV) multiplication 
is compressed row storage (CRS), naive CRS representation of X would be detrimental to run-time 
on computing hardware with limited low-level caches, such as CPUs, since CRS is designed for row- 
by-row access. On the other hand, loading matrix M does enter into W as a spMV multiplication 
operation when X is dense, suggesting CRS. For sparse X, #(Xj) <^ ^^(M), so precomputing 
MXj for all j in coordinate (COO) representation is simple and effective. COO representation 
consists of two index arrays, one to hold the row-indicators and one to hold the column-indicators 
of the non-zero entries, held in a third value array. Here the column-indicators of MXj conveniently 
are the same as the column-indicators of Xj. Stored as a structure of arrays (SoAs), memory access 
to the row- and column-indicators is sequential, well-cached on the host CPU and coalesced on the 
GPU. However, retrieving the individual elements of L x exp(X/3) remains irregular. Finally, the 
non-zero entries of both X and N are all one, so they need not be stored. 

Executing an independent thread for each non-zero element of Xj to update M [L x exp(X/3) x Xj] 
may result in race conditions when multiple threads attempt memory transactions on the same ele- 
ments in M [L X exp(X/3) x Xj] in global memory. One solution entertains launching one or more 
cooperative threads tied to each of the N rows in MXj. This may generate large inefficiencies as 
many rows in MXj contain only zeros. Alternatively, the last few generations of GPUs contain 
small on-processor memory caches that enable relatively quick "atomic" transactions, in which only 
one thread may access a specific address in global memory at a time. This avoids race conditions 
and allows us to fuse the sparse updates of Steps ([T]) through ^ together into a single kernel. 
Listing [3] presents our sparse CUDA kernel to update X/3, L x exp(X/3), and M [L x exp(X/3)] 
given A/3j. We envoke this kernel with one thread per non-zero entry in Xj, grouping threads into 
large blocks to help hide memory latency. 



10 



Listing 3: Sparse CUDA kernel for updating X/3, L x exp(X/3), and M [L x exp(X/3)] given Xj 

and A/3j 

__global__ void sparseUpdate ( 

real* XBeta , real* LExpXBeta, real* NLExpXBeta, 

const int* L, const int* sparse_rows , const int* sparse.columns , 
real DeltaBeta , int NumNonZeros) { 

/ / Determine sparse element index for this thread 

int idx = blockldx.x * blockDim.x + threadldx.x; 

if (idx < NumNonZeros) { 

int n = rows [idx]; // Coalesced memory access 

int k = columns [ idx ] ; // Coalesced memory access 

// Read sparse elements, many non-coalesced 
real oldXBeta = XBeta [k]; 
real oldLExpXBeta = LExpXBeta [ k ] ; 
int Lk = L[k] ; 

// Compute new values 

real newXBeta = oldXBeta + DeltaBeta; 
real newLExpXBeta = Lk * exp (newXBeta ) ; 

// Write sparse elements, many non-coalesced 
XBeta [k] = newXBeta; 
LExpXBeta [k] = newLExpXBeta; 

/ / Enforce single thread access at any time 

atomicAdd(&NLExpXBeta[n] , (newLExpXBeta - oldLExpXBeta)); 

} 

} 
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Precision Graphics rendering traditionally requires at most 32-bit (single precision) floating- 
point computation to encompass 8-bits of red, green, blue and alpha. Ensuingly, GPU performance 
remains greatest at single precision. While the latest generations of GPUs can operate with 64- 
bit (double precision) numbers, the precision boost comes with a performance cost because the 
GPU contains fewer double precision arthimetic logical units, resulting in approximately half the 
maximum floating-point operations per second. Further, double precision mandates reading and 
writing twice as much information. For fitting the BSCCS model to massive datasets, single preci- 
sion arthimetic suffices; the computations do not involve substracting approximately equal quanti- 
ties, nor multiplying small quantities, both of which may lead to underfiow . To d emonstrate this 
point, finding /9map BSCCS is a convex optimization problem (Simpson, 2011). Subsequently, 
hj{f3) < and all elements of [W x (1 — W)] > 0. Likewise, all elements of L, exp (X/3) and, 
therefore, W > 0. 



2.5 Hyperparameter Selection and Measures of Coefficient Uncertainty 

We aim to provide a full Bayesian analysis of all unknown parameters in our model. However, 
at present this remains beyond our computational limits. As a stopgap solution, we borrow two 
frequentist Monte Carlo proceedures. We learn about the hyperparameter o"^ through a 10- 
fold cross-validation scheme. Previously, the computational cost of fitting the BSCCS model was 
too great to consider the hyperparameter as random, requiring an arbitrarily fixed value. Under 
this cross-validation scheme, we randomly separate the cases-only dataset into 10 portions, fit the 
BSCCS model via CCD on 10—1 of these portions and compute the predictive log-likelihood L (/3) of 
the remaining portion given the fit. We repeat this process across a log-scale grid of hyperparameter 
values and chose the hyperparameter that maximizes the predictive log-likelihood. To moderately 
reduce the number of iterations required to acheive convergence of the CCD algorithm for successive 
hyperparameter values, we order the grid values from smallest to largest prior variance and exploit 
a series of "warm-starts." At small variance under the Laplacian prior, most coefficients shrink to 
and only slowly enter into the regression as the variance increases (Wu et al. , 2009). Under the 



warm-start, the maximized regression coefficients from the previous fit serve as starting values for 
the next fit. In general, the predictive log-likelihood surface is relatively flat in the region around its 
maximum, so precise estimation of the hyperparameter is unnecessary. Alternative maximization 
strategies involving an initial bracketing of the maximized predictive log-likelihood and an intervaled 
line search often yield more precise estimates in fewer evaluations of the predictive log-likelihood. 

Along with the infeasibility of estimating the hyperparameter, generating measures of uncer- 
tainty on the regression coefficients has remained taxing, to say the least, for the BSCCS model 
applied to massive observational databases. As a first attack at this problem, we examine the 
non-parametric bootstrap (Efron and Tibshirani, 1986) in the context of a Li regularized GLM 



(Park and Hastie, 2007). Procedures for generating standard errors for parameter estimates in the 



context of Li or L2 regularization that are both computationally efficient and theoretically well- 
supported remain out of reach. The simple non-parametric bootstrap approach we pursue here 



has some short-comings (see Chatterjee and Lahiri (2011) for a related discussion in the context 



of linear regression), but we view it as a pragmatic approach pending a more complete solution. 
Without getting embroiled in this discussion, we report 95% confidence intervals derived from the 
2.5% and 97.5% quantiles of 200 bootstrap samples. Under the Laplacian prior and as a "poor 
man's estimate" of the posterior probability that f3j 7^ 0, we also report pj for each drug, the 
observed bootstrap proportion in which Pj achieves a non-zero MAP estimate. 
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3 Demonstration 



We examine the computational performance of fitting the BSCCS model across several large-scale 
observational databases and AEs. In particular, we show results from two medical claims databases 
and acute liver injury, acute renal failure, bleeding and upper gestrointestinal tract ucler hospital- 
ization events in order to provide an cxamplar range of dataset sizes. Our results explore the 
effects of optimization and parallclization and are not meant here to identify medical products 
associated with these events. The MarketScan"^^ Commercial Claims and Encounters (CCAE) 
Research Database from Thomson Reuters is a large administrative claims database containing 
59 million privately insured lives and provides patient-level deidentificd data from inpatient and 
outpatient visits and pharmacy claims of multiple large employers. The MarketScan Lab Database 
(MSLR) contains 1.5 million persons representing a largely privately-insured population, with ad- 
ministrative claims from inpatient, outpatient, and pharmacy services supplemented by laboratory 
results. These databases constitute part of the data community established within the the Obser- 
vational Medical Outcomes Partnership (OMOP). OMOP is a public-private partnership between 
government, industry and academia to conduct methodological research to inform the appropriate 
use of observational healthcare data for active medical product surveillance. 

These example datascts span N = 115K to 3.6M cases-only patients taking J = 1224 to 1428 
different drugs. The datasets provide K = 3.8M to 75M total (unique) exposure eras per analysis. 
We perform all benchmarking on the Amazon Elastic Compute Cloud, exploiting an Intel Xeon 
X5570 CPU @ 2.93GHz and one NVIDIA Tesla C2050. This GPU device sports 448 cores @ 
1.15GHz. Performance on less expensive, commodity-grade CPUs, such as the NVIDIA GTX580, 
is often greater due to a slightly larger number of cores per GPU and higher memory-bandwidth. 
Due to data licensing agreements, however, we are restricted to Amazon hardware. 
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Figure 2: Maximum a posteriori estimation for several observational databases under the Bayesian 
self-controlled cases series model. We provide run-times for three implementations: dense com- 
putation on the CPU (white circles), sparse computation on the CPU (black circles) and sparse 
computation on the GPU (black squares). 
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Figure [2] presents the relative speed-up our algorithms enjoy when inferring MAP estimates. 
These gains first compare implementing Steps[2]-|4]as sparse operations and then porting computing 
to the GPU. Sparsity generates up to a 181-fold speed-up; while the GPU multiplies this by up 
to another 37-fold. To put these times on an absolute scale, MAP estimation for our largest 
dataset originally drained over 51 hours; sparse operations on the GPU reduce this time to 29 
seconds. Naturally, with fitting times standing in the 10s of hours, the hopes for cross-validation 
or bootstrap remain low, but grow very practical at 10s of seconds per replicate. 




Figure 3: Exemplar uncertainty analysis of angioedema as an adverse event under the Li prior. 
Here, we plot the non-parametric bootstrap 95% confidence intervals for the 441 drug effects that 
demonstrated non-zero coefficients in at least 50% of the bootstrap replicates. Gray-scaling reports 
the proportion of bootstrap replicates in which effect estimates are non-zero. 

Cross-validation to learn the hyperparameter o"^ across these four datasets returns optimal 
variances ranging from 0.05 to 0.15 for Li and 0.02 to 0.13 for L2. Importantly, these ranges are 
approximately an order-of-magnitude smaller than the arbitrary fixed value previously assumed 
in our BSCCS studies for drug surveillance. Employing the optimal hyperparameter, Figure [3] 
reports non-parameteric bootstrap confidence intervals of drug effects for a single representative 
dataset under the Li prior. This dataset explores angioedema events within the CCAE database 
and contains N = 76K case-only patients, taking J = 1162 drugs and yielding K = 2.1M exposure 
eras. In the figure, we first rank all drugs by their MAP estimate f3j in decreasing order and then 
plot the 95% confidence intervals for the 441 drugs for which pj > 0.50. Darker interval shading 
reflects larger pj. While a general trend holds in which larger \f3j\ more often return 95% confidence 
intervals that do not cover 0, we identify notable exceptions. Namely, Drotrecogin alfa, an anti- 
thrombotic, anti-inflammatory agent used in the treatment of severe sepsis, returns with the fourth 
largest effect estimate, but its confidence interval continues to cover 0, reflecting the high sampling 
variability in this estimate. 
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4 Discussion 



Efficient algorithmic design and massive parallelization open the door for fitting complex GLMs to 
massive datasets. Computational statisticians regularly capitalize on the sparsity of their model and 
data, and this is an important design issue for the BSCCS we consider here, since the design matrix 
X consists purely of sparse covariates. In particular, we identify that the sparsity of X carries all 
the way through to computing the subject-specific contributions to the model gradient and Hessian, 
resulting in over a 100-fold speed-up compared to the most advanced CCD algorithms for GLMs 
of which we are aware. Many GLMs, however, command dense covariates as well, such as baseline 
measurements, and other techniques become necessary. Here fusing multiple transformations and 
reduction together into vectorizable kernels is the first step in off-loading the work to the GPU, 
and we hope our discussion in this paper raises awareness of these techniques among computational 
statisticians. The end result for the BSCCS model is an approximate 30-fold speed-up on a single 
GPU compared to a single CPU core. These techniques also port directly to utilizing multi-core 
CPUs and multiple CPUs simultaneously, although we do not explore this avenue in this paper to 
simplify comparisons. 

Advancing model complexity is both possible in the sampling density of the data and in the 
prior assumptions on the unknown model parameter. Here we have only considered independent 
and identically distributed prior densities over the drug effect sizes. More biologically plausible 
hierarchical distributions are conveniently available. For example, to borrow strength, we may favor 
grouping drugs a priori into classes based on mode of action or therapeutic targets. Similarly, we 
may wish to explore borrowing strength across related outcomes. Because computation of the prior 
gradient and Hessian remains extremely light-weight, no modification to the GPU code is necessary 
and run-times should remain as quick. 

One immediate advantage of the orders-of-magnitude reduction in run-time stands the ability 
to nest point-estimation within both cross-validation and bootstrap frameworks, making these 
Monte Carlo frameworks feasible. Cross-validation and bootstrapping begin to allow us to estimate 
model hyperparameters and report measures of uncertainty around the usual point-estimates. For 
the drug surveillance community, this represents a giant leap forward. For example, most of the 
statistial methods in OMOP are implemented in the statistical packages SAS or R (http : / / omop . 
|fnih. org/MethodsLibrary). One of our own preliminary implementations of the BSCCS model 
in R, using just a sparse matrix package and no further linear algebra libraries, requires around 
5.3 hours to generate a single MAP estimate from a dataset with only N = 7460. With this 
benchmark in mind, it is no wonder why almost all computationally expensive fitting of massive 
datasets in the field has ignored cross-validation and bootstrapping; see, e.g.. Funk et al. (2011), 
and a presentation at the 2011 International Congress on Pharmacoepidemiology involving a similar 
study redoubled this point by claiming that bootstrapping is computationally infeasible with more 
than 20K patients. High performance statistical computing involving massive parallelization shows 
that these limitations are quickly lifting. 

We achieve this success by exploiting the GPU within a serial CCD algorithm. CCD is a 
generic optimization approach and we envision extensions working for massive dataset applied to 
models beyond the GLM setting as well. Moving past CCD, Zhou et al. (2010) consider similar 
block-relaxation and majorization techniques to attack large-scale matrix factorization and multi- 
dimensional scaling using CPUs. Here and in CCD, one breaks a high- dimensional optimization 



problem into a series of low-dimensional updates that involve many scalar operations. As Zhou 



et al. (2010) demonstrate with their quasi-Newton acceleration application, one ideally aims for one- 



dimensional updates, as even slightly higher-dimensional operations carry heavy data-dependency 
that can outweigh the advantages of the GPU. 
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To accomplish parallelization within a serial algorithm, we take advantage of the wide vector- 
processing capabilities of the GPU to perform simple operations simultaneously across a large 
input of data. This vectorization lacks branches in the kernel code, avoiding thread divergence and 
serialization of the work within the wide vector. As a result, expected speed-up scales most directly 
with the quantity of data. This differs considerably from distributing EP tasks, such as those 
that arise in many Monte Carlo approaches including the independent and often divergent particle 
evolution in a sequential Monte Carlo, to separate cores of the GPU. Here, we receive at little 
cost more particles and higher precision estimates with additional cores. Unfortunately, however, 
this strategy loses out on scaling in the critical dimension of the data as massively parallel devices 
continue to mushroom in size. 
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